In [1]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_rows = 8

Reshaping data with `stack` and `unstack`

Pivoting

Data is often stored in CSV files or databases in so-called “stacked” or “record” format:



In [2]:

    
df = pd.DataFrame({'subject':['A', 'A', 'B', 'B'], 
                   'treatment':['CH', 'DT', 'CH', 'DT'], 
                   'concentration':range(4)},
                 columns=['subject', 'treatment', 'concentration'])
df









    Out[2]:






  
    
      
      subject
      treatment
      concentration
    
  
  
    
      0
      A
      CH
      0
    
    
      1
      A
      DT
      1
    
    
      2
      B
      CH
      2
    
    
      3
      B
      DT
      3

A better representation might be one where the different subjects are in rows, the applied treatments are in columns and outcomes are in the data frame values.

You can achieve this by pivot function:



In [3]:

    
pivoted = df.pivot(index='subject', columns='treatment', values='concentration')
pivoted

If there is more that one record for each pair of "subject" and "treatment" (for example, the subject was tested twice with the same treatment at different times) you can use pivot_table. It works just like pivot but it allows to specify additionally an aggregation function ('mean' by default).

To take another example, we will use some data from expeditions to the Pole of Inaccessibility. We will read the data from SQL database.



In [4]:

    
from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/survey.db')



In [5]:

    
visited = pd.read_sql('Visited', engine, index_col='ident', parse_dates=['dated'])
visited



In [6]:

    
readings = pd.read_sql('Survey', engine).dropna()
readings = readings.drop_duplicates()
readings









    Out[6]:






  
    
      
      taken
      person
      quant
      reading
    
  
  
    
      0
      619
      dyer
      rad
      9.82
    
    
      1
      619
      dyer
      sal
      0.13
    
    
      2
      622
      dyer
      rad
      7.80
    
    
      3
      622
      dyer
      sal
      0.09
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      17
      837
      lake
      rad
      1.46
    
    
      18
      837
      lake
      sal
      0.21
    
    
      19
      837
      roe
      sal
      22.50
    
    
      20
      844
      roe
      rad
      11.25
    
  

19 rows × 4 columns

EXERCISE: Join the `readings` and `visited` tables.



In [ ]:

EXERCISE: Pivot the table such that we have sites in rows and different quantities in columns.



In [ ]:

Hierarchical index

Hierarchical index of pandas is a way of introducing another dimension to a (two-dimensional) data frame. This is implemented by having multiple levels of the index. Let's look at an example.



In [9]:

    
multi = df.set_index(['subject', 'treatment'])
multi









    Out[9]:






  
    
      
      
      concentration
    
    
      subject
      treatment
      
    
  
  
    
      A
      CH
      0
    
    
      DT
      1
    
    
      B
      CH
      2
    
    
      DT
      3

Note how the two indexes are nested: 2nd level index ('treatment') is grouped under the first level index ('subject'). To access the two levels you can use labels from the first level or both levels using a tuple.



In [10]:

    
multi.loc['A'] # first level only









    Out[10]:






  
    
      
      concentration
    
    
      treatment
      
    
  
  
    
      CH
      0
    
    
      DT
      1

Note that it creates a standard data frame with "flat" index.



In [11]:

    
multi.loc[('A', 'CH')] # two level









    Out[11]:





concentration    0
Name: (A, CH), dtype: int64

Indexing on the second index only may be slightly involved:



In [12]:

    
multi.loc[(slice(None), 'CH'), :]









    Out[12]:






  
    
      
      
      concentration
    
    
      subject
      treatment
      
    
  
  
    
      A
      CH
      0
    
    
      B
      CH
      2

Consult the documentation for other methods.

To return to orginal format with columns insted of indexes use reset_index:



In [13]:

    
multi.reset_index()









    Out[13]:






  
    
      
      subject
      treatment
      concentration
    
  
  
    
      0
      A
      CH
      0
    
    
      1
      A
      DT
      1
    
    
      2
      B
      CH
      2
    
    
      3
      B
      DT
      3

EXERCISE: Group the survey data by sites, date of measurement on each site and the quantity measured. List all readings for `site` DR-1; all readings of radiation using the hierchical index.



In [ ]:

`stack/unstack`

stack — shifts last level of hierarchical rows to columns

unstack — does the opposite, i.e. shifts last level of hierarchical columns to rows



In [15]:

    
result = multi['concentration'].unstack()
result









    



/home/bartosz/anaconda/envs/swc_eitn/lib/python3.5/site-packages/pandas/core/index.py:4281: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  return np.sum(name == np.asarray(self.names)) > 1






    Out[15]:






  
    
      treatment
      CH
      DT
    
    
      subject
      
      
    
  
  
    
      A
      0
      1
    
    
      B
      2
      3

unstack reverses the operation:



In [16]:

    
result.stack()









    Out[16]:





subject  treatment
A        CH           0
         DT           1
B        CH           2
         DT           3
dtype: int64

We can "stack" it even further:



In [17]:

    
df = multi.stack()
df









    Out[17]:





subject  treatment               
A        CH         concentration    0
         DT         concentration    1
B        CH         concentration    2
         DT         concentration    3
dtype: int64

EXERCISE: Rearange the data frame from last exercise, such that rows contain sites and dates (hierchical index) and columns different quantities. List all readings of radiation.



In [ ]:

Formatting data — Case study

Going further with the time series case study test on the AirBase (The European Air quality dataBase) data.

One of the actual downloaded raw data files of AirBase is included in the repo:



In [19]:

    
!head -1 ./data/BETR8010000800100hour.1-1-1990.31-12-2012









    



1990-01-01	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0	-999.000	0

Just reading the tab-delimited data:



In [20]:

    
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')#, header=None)

data.head()









    Out[20]:






  
    
      
      1990-01-01
      -999.000
      0
      -999.000.1
      0.1
      -999.000.2
      0.2
      -999.000.3
      0.3
      -999.000.4
      ...
      -999.000.19
      0.19
      -999.000.20
      0.20
      -999.000.21
      0.21
      -999.000.22
      0.22
      -999.000.23
      0.23
    
  
  
    
      0
      1990-01-02
      -999
      0
      -999
      0
      -999
      0
      -999
      0
      -999
      ...
      57
      1
      58
      1
      54
      1
      49
      1
      48
      1
    
    
      1
      1990-01-03
      51
      1
      50
      1
      47
      1
      48
      1
      51
      ...
      84
      1
      75
      1
      -999
      0
      -999
      0
      -999
      0
    
    
      2
      1990-01-04
      -999
      0
      -999
      0
      -999
      0
      -999
      0
      -999
      ...
      69
      1
      65
      1
      64
      1
      60
      1
      59
      1
    
    
      3
      1990-01-05
      51
      1
      51
      1
      48
      1
      50
      1
      51
      ...
      -999
      0
      -999
      0
      -999
      0
      -999
      0
      -999
      0
    
    
      4
      1990-01-06
      -999
      0
      -999
      0
      -999
      0
      -999
      0
      -999
      ...
      -999
      0
      -999
      0
      -999
      0
      -999
      0
      -999
      0
    
  

5 rows × 49 columns

The above data is clearly not ready to be used! Each row contains the 24 measurements for each hour of the day, and also contains a flag (0/1) indicating the quality of the data.

Lets replace the negative numbers by missing values and give columns proper names.



In [21]:

    
hours = map(str, range(24))
flags = ['flag'] * 24
col_names = ['date'] + list(sum(zip(hours, flags), ()))
col_names[:5]









    Out[21]:





['date', '0', 'flag', '1', 'flag']



In [22]:

    
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t',
                   na_values=['-999', '-9999'],
                   names=col_names,
                   index_col='date')#, header=None)

For now, we disregard the 'flag' columns



In [23]:

    
data = data.drop('flag', axis=1)
data.head()









    Out[23]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
    
    
      date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1990-01-01
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1990-01-02
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      48
      ...
      55
      59
      58
      59
      58
      57
      58
      54
      49
      48
    
    
      1990-01-03
      51
      50
      47
      48
      51
      52
      58
      57
      NaN
      NaN
      ...
      69
      74
      NaN
      NaN
      103
      84
      75
      NaN
      NaN
      NaN
    
    
      1990-01-04
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      71
      74
      70
      70
      69
      65
      64
      60
      59
    
    
      1990-01-05
      51
      51
      48
      50
      51
      58
      65
      66
      69
      74
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 24 columns

Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index.

`stack` at work

We can now use stack and some other functions to create a timeseries from the original dataframe:

EXERCISE: Reshape the dataframe to a timeseries

The end result should look like:

	BETR801
1990-01-02 09:00:00	48.0
1990-01-02 12:00:00	48.0
1990-01-02 13:00:00	50.0
1990-01-02 14:00:00	55.0
...	...
2012-12-31 20:00:00	16.5
2012-12-31 21:00:00	14.5
2012-12-31 22:00:00	16.5
2012-12-31 23:00:00	15.0

170794 rows × 1 columns

First, reshape the dataframe so that each row consists of one observation for one date + hour combination:



In [ ]:

Now, combine the date and hour colums into a datetime (tip: string columns can be summed to concatenate the strings):



In [ ]:



In [ ]:

Acknowledgement

© 2015, Stijn Van Hoey and Joris Van den Bossche (mailto:stijnvanhoey@gmail.com, mailto:jorisvandenbossche@gmail.com).

© 2015, modified by Bartosz Teleńczuk (original sources available from https://github.com/jorisvandenbossche/2015-EuroScipy-pandas-tutorial)

Licensed under CC BY 4.0 Creative Commons

	site	dated
ident
619	DR-1	1927-02-08
622	DR-1	1927-02-10
734	DR-3	1930-01-07
735	DR-3	1930-01-12
751	DR-3	1930-02-26
752	DR-3	NaT
837	MSK-4	1932-01-14
844	DR-1	1932-03-22

	taken	person	quant	reading
0	619	dyer	rad	9.82
1	619	dyer	sal	0.13
2	622	dyer	rad	7.80
3	622	dyer	sal	0.09
...	...	...	...	...
17	837	lake	rad	1.46
18	837	lake	sal	0.21
19	837	roe	sal	22.50
20	844	roe	rad	11.25

	1990-01-01	-999.000	0	-999.000.1	0.1	-999.000.2	0.2	-999.000.3	0.3	-999.000.4	...	-999.000.19	0.19	-999.000.20	0.20	-999.000.21	0.21	-999.000.22	0.22	-999.000.23	0.23
0	1990-01-02	-999	0	-999	0	-999	0	-999	0	-999	...	57	1	58	1	54	1	49	1	48	1
1	1990-01-03	51	1	50	1	47	1	48	1	51	...	84	1	75	1	-999	0	-999	0	-999	0
2	1990-01-04	-999	0	-999	0	-999	0	-999	0	-999	...	69	1	65	1	64	1	60	1	59	1
3	1990-01-05	51	1	51	1	48	1	50	1	51	...	-999	0	-999	0	-999	0	-999	0	-999	0
4	1990-01-06	-999	0	-999	0	-999	0	-999	0	-999	...	-999	0	-999	0	-999	0	-999	0	-999	0

	0	1	2	3	4	5	6	7	8	9	...	14	15	16	17	18	19	20	21	22	23
date
1990-01-01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1990-01-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	48	...	55	59	58	59	58	57	58	54	49	48
1990-01-03	51	50	47	48	51	52	58	57	NaN	NaN	...	69	74	NaN	NaN	103	84	75	NaN	NaN	NaN
1990-01-04	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	71	74	70	70	69	65	64	60	59
1990-01-05	51	51	48	50	51	58	65	66	69	74	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Reshaping data with stack and unstack